1,734 research outputs found

    Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance

    Get PDF
    Health surveillance systems based on online user-generated content often rely on the identification of textual markers that are related to a target disease. Given the high volume of available data, these systems benefit from an automatic feature selection process. This is accomplished either by applying statistical learning techniques, which do not consider the semantic relationship between the selected features and the inference task, or by developing labour-intensive text classifiers. In this paper, we use neural word embeddings, trained on social media content from Twitter, to determine, in an unsupervised manner, how strongly textual features are semantically linked to an underlying health concept. We then refine conventional feature selection methods by a priori operating on textual variables that are sufficiently close to a target concept. Our experiments focus on the supervised learning problem of estimating influenza-like illness rates from Google search queries. A "flu infection" concept is formulated and used to reduce spurious and potentially confounding features that were selected by previously applied approaches. In this way, we also address forms of scepticism regarding the appropriateness of the feature space, alleviating potential cases of overfitting. Ultimately, the proposed hybrid feature selection method creates a more reliable model that, according to our empirical analysis, improves the inference performance (Mean Absolute Error) of linear and nonlinear regressors by 12% and 28.7%, respectively

    E-NER - An Annotated Named Entity Recognition Corpus of Legal Text

    Get PDF
    Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4% and 60.4%, compared to training and testing on the E-NER collection

    Hepatic steatosis and fibrosis: Non-invasive assessment

    Get PDF
    Chronic liver disease is a major cause of morbidity and mortality worldwide and usually develops over many years, as a result of chronic inflammation and scarring, resulting in end-stage liver disease and its complications. The progression of disease is characterised by ongoing inflammation and consequent fibrosis, although hepatic steatosis is increasingly being recognised as an important pathological feature of disease, rather than being simply an innocent bystander. However, the current gold standard method of quantifying and staging liver disease, histological analysis by liver biopsy, has several limitations and can have associated morbidity and even mortality. Therefore, there is a clear need for safe and noninvasive assessment modalities to determine hepatic steatosis, inflammation and fibrosis. This review covers key mechanisms and the importance of fibrosis and steatosis in the progression of liver disease. We address non-invasive imaging and blood biomarker assessments that can be used as an alternative to information gained on liver biopsy

    Hepatocellular carcinoma: Review of disease and tumor biomarkers.

    Get PDF
    © The Author(s) 2016.Hepatocellular carcinoma (HCC) is a common malignancy and now the second commonest global cause of cancer death. HCC tumorigenesis is relatively silent and patients experience late symptomatic presentation. As the option for curative treatments is limited to early stage cancers, diagnosis in non-symptomatic individuals is crucial. International guidelines advise regular surveillance of high-risk populations but the current tools lack sufficient sensitivity for early stage tumors on the background of a cirrhotic nodular liver. A number of novel biomarkers have now been suggested in the literature, which may reinforce the current surveillance methods. In addition, recent metabonomic and proteomic discoveries have established specific metabolite expressions in HCC, according to Warburgs phenomenon of altered energy metabolism. With clinical validation, a simple and non-invasive test from the serum or urine may be performed to diagnose HCC, particularly benefiting low resource regions where the burden of HCC is highest

    Retrieval of highly dynamic information in an unstructured peer-to-peer network

    Get PDF
    We present a framework for the retrieval of highly dynamic information in an unstructured peer-to-peer network. Non- exhaustive search in an unstructured network is necessar- ily probabilistic, and we utilize the probably approximately correct (PAC) search architecture to determine the required replication rate for a document in order to guarantee a high probability of retrieval. Once this replication rate is deter- mined, the problem becomes how to replicate a new docu- ment across the network to meet this requirement, without overloading the communication capacity of the network. To solve this, we model the problem as rumour spreading, and use techniques from this field to propagate new documents. Our document spreading algorithm is designed such that a document has a very high probability of being replicated to the required number of nodes, but the probability of spread- ing to fewer or more nodes is small. Apart from facilitating rapid and restrained dissemination, our proposed method also withstands sudden spikes in the data creation rate. We illustrate the utility of the framework in the context of a micro-blogging social network. However it could also be used to index dynamic web pages in a distributed search engine or for a system which indexes newly created BitTorrents in a de-centralized environment. Simulations performed on net- work of 100,000 nodes validate our proposed framework

    The role of mass spectrometry in hepatocellular carcinoma biomarker discovery

    Get PDF
    Hepatocellular carcinoma (HCC) is the main liver malignancy and has a high mortality rate. The discovery of novel biomarkers for early diagnosis, prognosis, and stratification purposes has the potential to alleviate its disease burden. Mass spectrometry (MS) is one of the principal technologies used in metabolomics, with different experimental methods and machine types for different phases of the biomarker discovery process. Here, we review why MS applications are useful for liver cancer, explain the MS technique, and briefly summarise recent findings from metabolomic MS studies on HCC. We also discuss the current challenges and the direction for future research

    Estimating the Population Impact of a New Pediatric Influenza Vaccination Program in England Using Social Media Content

    Get PDF
    BACKGROUND: The rollout of a new childhood live attenuated influenza vaccine program was launched in England in 2013, which consisted of a national campaign for all 2 and 3 year olds and several pilot locations offering the vaccine to primary school-age children (4-11 years of age) during the influenza season. The 2014/2015 influenza season saw the national program extended to include additional pilot regions, some of which offered the vaccine to secondary school children (11-13 years of age) as well. OBJECTIVE: We utilized social media content to obtain a complementary assessment of the population impact of the programs that were launched in England during the 2013/2014 and 2014/2015 flu seasons. The overall community-wide impact on transmission in pilot areas was estimated for the different age groups that were targeted for vaccination. METHODS: A previously developed statistical framework was applied, which consisted of a nonlinear regression model that was trained to infer influenza-like illness (ILI) rates from Twitter posts originating in pilot (school-age vaccinated) and control (unvaccinated) areas. The control areas were then used to estimate ILI rates in pilot areas, had the intervention not taken place. These predictions were compared with their corresponding Twitter-based ILI estimates. RESULTS: Results suggest a reduction in ILI rates of 14% (1-25%) and 17% (2-30%) across all ages in only the primary school-age vaccine pilot areas during the 2013/2014 and 2014/2015 influenza seasons, respectively. No significant impact was observed in areas where two age cohorts of secondary school children were vaccinated. CONCLUSIONS: These findings corroborate independent assessments from traditional surveillance data, thereby supporting the ongoing rollout of the program to primary school-age children and providing evidence of the value of social media content as an additional syndromic surveillance tool

    A Concept Language Model for Ad-hoc Retrieval

    Get PDF
    We propose an extension to language models for information retrieval. Typically, language models estimate the probability of a document generating the query, where the query is considered as a set of independent search terms. We extend this approach by considering the concepts implied by both the query and words in the document. The model combines the probability of the document generating the concept embodied by the query, and the traditional language model probability of the document generating the query terms. We use a word embedding space to express concepts. The similarity between two vectors in this space is estimated using a weighted cosine distance. The weighting significantly enhances the discrimination between vectors. We evaluate our model on benchmark datasets (TREC 6–8) and empirically demonstrate it outperforms state-of-the-art baselines

    Information-Theoretic Active Learning for Content-Based Image Retrieval

    Full text link
    We propose Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, and apply it for acquiring meaningful user feedback in the context of content-based image retrieval. Instead of combining different heuristics such as uncertainty, diversity, or density, our method is based on maximizing the mutual information between the predicted relevance of the images and the expected user feedback regarding the selected batch. We propose suitable approximations to this computationally demanding problem and also integrate an explicit model of user behavior that accounts for possible incorrect labels and unnameable instances. Furthermore, our approach does not only take the structure of the data but also the expected model output change caused by the user feedback into account. In contrast to other methods, ITAL turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.Comment: GCPR 2018 paper (14 pages text + 2 pages references + 6 pages appendix

    Relationship Between Media Coverage and Measles-Mumps-Rubella (MMR) Vaccination Uptake in Denmark: Retrospective Study

    Get PDF
    BACKGROUND: Understanding the influence of media coverage upon vaccination activity is valuable when designing outreach campaigns to increase vaccination uptake. OBJECTIVE: To study the relationship between media coverage and vaccination activity of the measles-mumps-rubella (MMR) vaccine in Denmark. METHODS: We retrieved data on media coverage (1622 articles), vaccination activity (2 million individual registrations), and incidence of measles for the period 1997-2014. All 1622 news media articles were annotated as being provaccination, antivaccination, or neutral. Seasonal and serial dependencies were removed from the data, after which cross-correlations were analyzed to determine the relationship between the different signals. RESULTS: Most (65%) of the anti-vaccination media coverage was observed in the period 1997-2004, immediately before and following the 1998 publication of the falsely claimed link between autism and the MMR vaccine. There was a statistically significant positive correlation between the first MMR vaccine (targeting children aged 15 months) and provaccination media coverage (r=.49, P=.004) in the period 1998-2004. In this period the first MMR vaccine and neutral media coverage also correlated (r=.45, P=.003). However, looking at the whole period, 1997-2014, we found no significant correlations between vaccination activity and media coverage. CONCLUSIONS: Following the falsely claimed link between autism and the MMR vaccine, provaccination and neutral media coverage correlated with vaccination activity. This correlation was only observed during a period of controversy which indicates that the population is more susceptible to media influence when presented with diverging opinions. Additionally, our findings suggest that the influence of media is stronger on parents when they are deciding on the first vaccine of their children, than on the subsequent vaccine because correlations were only found for the first MMR vaccine
    • …
    corecore